We have joined a machine learning competition in drivendata website, to predict how likely individuals are to receive their H1N1 and seasonal flu vaccines. We are predicting two probabilities:
We have three datasets:
Additionaly, we have a submission sample, which is necessary to succesfully make our submission.
The full documentation of the dataset is available in the competition official page.
Before start the analysis, we import the necessary libraries:
import pandas as pd
import plotly.express as px
import plotly
plotly.offline.init_notebook_mode()
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(14,10)})
Let's take a look at our training features:
# Training Features
train_features = pd.read_csv('training_set_features.csv')
train_features.head()
# Test Features
test_features = pd.read_csv('test_set_features.csv')
test_features.head()
Since we want to scale and encode our data, we must merge both datasets to get the same transformations for the variables:
full_data = pd.concat([train_features, test_features], axis=0)
full_data.head()
Our task is predict two different probabilities, we must split our data in two different DataFrames:
We can identify some specific columns related to each vaccine, looking at the name of the columns:
full_data.columns
# Choosing columns based in the vaccine name
h1n1_cols = [col for col in full_data.columns if 'h1n1' in col]
seas_cols = [col for col in full_data.columns if 'seas' in col]
rest_cols = [col for col in full_data.columns if col not in (h1n1_cols+seas_cols)]
h1n1_cols
seas_cols
rest_cols
Now that we have isolated the related columns to each vaccine and the common columns, we can create two different DataFrames: one for each vaccine:
# Data for each vaccine
h1n1_data = full_data[h1n1_cols + rest_cols]
seas_data = full_data[seas_cols + rest_cols]
We have numerical and categorical data in both datasets. Now we want to discard high cardinality categorical variables, because is not easy to make predictions with them. The threshold that we'll establish to filter the columns will be 10.
# Keeping low cardinality columns
h1n1_categorical_cols = [col for col in h1n1_data.columns if h1n1_data[col].nunique() < 10 and h1n1_data[col].dtype == 'object']
seas_categorical_cols = [col for col in seas_data.columns if seas_data[col].nunique() < 10 and seas_data[col].dtype == 'object']
# Numerical columns
h1n1_numerical_cols = [col for col in h1n1_data.columns if h1n1_data[col].dtype in ['float64', 'int64']]
seas_numerical_cols = [col for col in seas_data.columns if seas_data[col].dtype in ['float64', 'int64']]
# Columns to keep
h1n1_data = h1n1_data[h1n1_categorical_cols + h1n1_numerical_cols]
seas_data = seas_data[seas_categorical_cols + seas_numerical_cols]
# h1n1 data, missings
h1n1_data.info()
# seas data, missings
seas_data.info()
Now we have to deal with missing data. In our case, the column with more missings income_poverty with around 4500 missing values. This represents like 20% of the data, which is important (assuming that we drop all rows). We want to drop this column. With the rest of the data, we want to simply fill it with the most common value.
# Dropping income_poverty
h1n1_data.drop('income_poverty', axis=1, inplace=True)
seas_data.drop('income_poverty', axis=1, inplace=True)
# Filling each column with the most common value
for col in h1n1_data.columns:
h1n1_data[col] = h1n1_data[col].fillna(h1n1_data[col].mode()[0])
for col in seas_data.columns:
seas_data[col] = seas_data[col].fillna(seas_data[col].mode()[0])
Now that we have all the data together, it's a good idea to encode categorical variables before definitely splitting the data (we need to do this step to make train test split).
# Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for column in h1n1_data.columns:
if h1n1_data[column].dtype == 'object':
h1n1_data[column] = le.fit_transform(h1n1_data[column])
for column in seas_data.columns:
if seas_data[column].dtype == 'object':
seas_data[column] = le.fit_transform(seas_data[column])
# Splitting test and training data
h1n1_train_data = h1n1_data[:len(train_features)]
h1n1_test_data = h1n1_data[len(train_features):]
seas_train_data = seas_data[:len(train_features)]
seas_test_data = seas_data[len(train_features):]
# Target and features
targets = pd.read_csv('training_set_labels.csv')
h1n1_target = targets[['respondent_id', 'h1n1_vaccine']]
seas_target = targets[['respondent_id', 'seasonal_vaccine']]
h1n1_full = h1n1_train_data.merge(h1n1_target, on='respondent_id')
seas_full = seas_train_data.merge(seas_target, on='respondent_id')
# Train Test Split
from sklearn.model_selection import train_test_split
# Features
X_h1n1 = h1n1_full.drop(['respondent_id', 'h1n1_vaccine'], axis=1)
X_seas = seas_full.drop(['respondent_id', 'seasonal_vaccine'], axis=1)
# Targets
y_h1n1 = h1n1_full['h1n1_vaccine']
y_seas = seas_full['seasonal_vaccine']
# Splitting the data into training and validation data
X_train_h1n1, X_test_h1n1, y_train_h1n1, y_test_h1n1 = train_test_split(
X_h1n1, y_h1n1, test_size=0.25, random_state=14)
X_train_seas, X_test_seas, y_train_seas, y_test_seas = train_test_split(
X_seas, y_seas, test_size=0.25, random_state=14)
# Scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Here, we want to scale also the test data, not only the validation data
X_train_h1n1_scaled = scaler.fit_transform(X_train_h1n1) # train data
X_test_h1n1_scaled = scaler.transform(X_test_h1n1) # validation data
h1n1_test_data = h1n1_test_data.drop('respondent_id', axis=1)
h1n1_test_data_scaled = scaler.transform(h1n1_test_data) # test data
X_train_seas_scaled = scaler.fit_transform(X_train_seas) # train data
X_test_seas_scaled = scaler.transform(X_test_seas) # validation data
seas_test_data = seas_test_data.drop('respondent_id', axis=1)
seas_test_data_scaled = scaler.transform(seas_test_data) # test data
We want to train two kind of models: boosting model and Neural Network.
from sklearn import metrics
# In this dictionary we will store the scores of each model
performance = {}
# Function to plot the ROC AUC Curve of each model
def roc_auc(model, vaccine=None):
'''Train a model and make predictions over validation
data, and returns ROC AUC Score and plots ROC AUC
Args:
- model: the model to make predictions
- vaccine: 'h1n1' or 'seasonal'
'''
if vaccine == 'h1n1':
X_train = X_train_h1n1_scaled
X_test = X_test_h1n1_scaled
y_train = y_train_h1n1
y_test = y_test_h1n1
else:
X_train = X_train_seas_scaled
X_test = X_test_seas_scaled
y_train = y_train_seas
y_test = y_test_seas
# Training the model
model.fit(X_train, y_train)
# Predictions over validation data
preds = model.predict_proba(X_test)
# Plot ROC Curve
fpr, tpr, thresholds = metrics.roc_curve(y_test, preds[:, 1])
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
estimator_name='example estimator')
display.plot()
plt.show()
# ROC AUC Score
print(f'ROC AUC Score {model.__class__.__name__} :', metrics.roc_auc_score(y_test, preds[:, 1]))
# Appending the score to the dictionary
if vaccine == 'h1n1':
performance[model.__class__.__name__ + '_H1N1'] = metrics.roc_auc_score(y_test, preds[:, 1])
else:
performance[model.__class__.__name__ + '_Seasonal'] = metrics.roc_auc_score(y_test, preds[:, 1])
# Boosting - H1N1
from xgboost import XGBClassifier
model_1_h1n1 = XGBClassifier()
# ROC Curve and ROC AUC
roc_auc(model_1_h1n1, 'h1n1')
# Boosting - Seasonal
model_1_seas = XGBClassifier()
# ROC Curve and ROC AUC
roc_auc(model_1_seas, 'seasonal')
# NN - H1N1
input_shape = [X_train_h1n1_scaled.shape[1]]
from tensorflow import keras
from tensorflow.keras import layers
# Red Neuronal
nn1 = keras.Sequential([
layers.BatchNormalization(input_shape=input_shape),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
layers.Dense(1, activation='sigmoid')
])
# Optimizer, Loss Function and Metric
nn1.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=[keras.metrics.AUC()]
)
# Early Stopping
early_stopping = keras.callbacks.EarlyStopping(
patience=5,
min_delta=0.001,
restore_best_weights=True,
)
# Training
history = nn1.fit(
X_train_h1n1_scaled, y_train_h1n1,
validation_data=(X_test_h1n1_scaled, y_test_h1n1),
batch_size=64,
epochs=200,
callbacks=[early_stopping],
)
# Results
history_df = pd.DataFrame(history.history)
history_df.iloc[:, [0, 2]].plot(title="Cross-entropy")
history_df.iloc[:, [1, 3]].plot(title="AUC")
preds = nn1.predict(X_test_h1n1_scaled)
performance['NN_H1N1'] = metrics.roc_auc_score(y_test_h1n1, preds.ravel())
# Plot ROC Curve
fpr, tpr, thresholds = metrics.roc_curve(y_test_h1n1, preds)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
estimator_name='example estimator')
display.plot()
plt.show()
# NN - Seasonal
input_shape = [X_train_seas_scaled.shape[1]]
# Red Neuronal
nn2 = keras.Sequential([
layers.BatchNormalization(input_shape=input_shape),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
layers.Dense(64, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
layers.Dense(1, activation='sigmoid')
])
# Optimizer, Loss Function and Metric
nn2.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=[keras.metrics.AUC()]
)
# Early Stopping
early_stopping = keras.callbacks.EarlyStopping(
patience=5,
min_delta=0.001,
restore_best_weights=True,
)
# Training
history = nn2.fit(
X_train_seas_scaled, y_train_seas,
validation_data=(X_test_seas_scaled, y_test_seas),
batch_size=64,
epochs=100,
callbacks=[early_stopping],
)
# Results
history_df = pd.DataFrame(history.history)
history_df.iloc[:, [0, 2]].plot(title="Cross-entropy")
history_df.iloc[:, [1, 3]].plot(title="AUC")
preds = nn2.predict(X_test_seas_scaled)
performance['NN_Seasonal'] = metrics.roc_auc_score(y_test_seas, preds.ravel())
# Plot ROC Curve
fpr, tpr, thresholds = metrics.roc_curve(y_test_seas, preds)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
estimator_name='example estimator')
display.plot()
plt.show()
Let's check our results:
# Results DataFrame
scores_df = pd.DataFrame(performance, index=['ROC AUC Score']).T.sort_values('ROC AUC Score', ascending=False)
scores_df['Vaccine'] = ['Seasonal', 'Seasonal', 'H1N1', 'H1N1']
scores_df['model'] = scores_df.index
scores_df = scores_df.reset_index(drop=True)
scores_df
fig = px.bar(scores_df, x='model', y=scores_df.columns[:1], color='Vaccine',
title='Obtained Scores', labels={'value': 'ROC AUC Score'})
fig.update_layout(height=400, width=900, template='plotly_white',
autosize=False, showlegend=False, yaxis_range=[0.7, 0.85])
fig.show()
The best models was the neural networks in both cases.
# Predictions
# 1. H1N1 Vaccine
h1n1_preds = nn1.predict(h1n1_test_data_scaled)
# 2. Seasonal Vaccine
seas_preds = nn2.predict(seas_test_data_scaled)
pd.Series(h1n1_preds.ravel())
# Labels - respondent_id
resp_ids = test_features['respondent_id']
h1n1_vaccine = pd.Series(h1n1_preds.ravel())
seas_vaccine = pd.Series(seas_preds.ravel())
# Submission DataFrame
submission = pd.concat([resp_ids, h1n1_vaccine, seas_vaccine], axis=1)
submission = submission.rename(columns={
0: 'h1n1_vaccine', 1: 'seasonal_vaccine'})
submission
# Exporting the Data as CSV to make the submission
submission.to_csv('submission.csv', index=False)
We reach a score of 0.83, we are in the world top 25%. The best global scores are around 0.86 ROC AUC Score.